Introduction
The purpose of this study is to examine Spotify’s music library in terms of various factors, including popularity, genres, and releases. As a result, we will gain insight into the business strategy of the company.
In this report, we intend to answer the following questions:
In this project, the following R packages were used for data analysis:
| Library | Description |
|---|---|
| ‘ggplot2’ | a R package dedicated to data visualizations. |
| ‘dplyr’ | a R package dedicated to data wrangling. |
| ‘tidyverse’ | a R package dedicated to data wrangling and data manipulation. |
| ‘knitr’ | a R package dedicated to an engine for dynamic report generation. |
From this repository GitHub , we have obtained the Spotify songs data for analysis.
Spotify data includes 32833 tracks and 23 attributes, such as track_popularity, danceability, loudness, tempo, and other characteristics of songs from 2019 to the late 1950s.
The following table illustrates the first five rows of raw data:
spotify_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
kable(head(spotify_data,5))
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
Data Structure
Spotify data includes 32833 tracks and 23 attributes.
print(paste('The data has',dim(spotify_data)[1],'rows and',dim(spotify_data)[2],'attributes'))
## [1] "The data has 32833 rows and 23 attributes"
We can see that 10 of the variables are character variables, while the remaining 13 are numerical variables. Below is a description of the spotify data.
spotify_data_dictionary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
datatable(spotify_data_dictionary, options = list(
autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 3)),
pageLength = 25,
lengthMenu = c(5, 10, 15, 20, 25)
))
Below is a summary statistics of the spotify data (only for numerical variables). We will dig deeply into the data and determine if there are any outliers or missing values.
summary(spotify_data[,c(4,12:23)])
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6548 Mean :0.698619 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness mode speechiness acousticness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -8.171 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151
## Median : -6.166 Median :1.0000 Median :0.0625 Median :0.0804
## Mean : -6.720 Mean :0.5657 Mean :0.1071 Mean :0.1753
## 3rd Qu.: -4.645 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550
## Max. : 1.275 Max. :1.0000 Max. :0.9180 Max. :0.9940
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96
## Median :0.0000161 Median :0.1270 Median :0.5120 Median :121.98
## Mean :0.0847472 Mean :0.1902 Mean :0.5106 Mean :120.88
## 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92
## Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
## duration_ms
## Min. : 4000
## 1st Qu.:187819
## Median :216000
## Mean :225800
## 3rd Qu.:253585
## Max. :517810
####Let us first examine the columns of the dataset.
glimpse(spotify_data)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
Secondly, we analyze the number of missing values per column so that we can determine whether they should be dropped, retained, or imputed with mean/median values.
colSums(is.na(spotify_data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify_data %>%
filter_all(any_vars(is.na(.)))
## # A tibble: 5 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0 717UG2du6utFe…
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA> <NA> 0 3luHJEPw434tv…
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA> <NA> 0 3luHJEPw434tv…
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA> <NA> 0 717UG2du6utFe…
## 5 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0 717UG2du6utFe…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
This data contains missing values in three variables (1. track_name, 2. track_album_name, and 3. track_artist) which represents less than 0.1% of the data. We have therefore decided to remove the rows with NA values, as this will not affect our analysis.
spotify_data <- spotify_data %>% drop_na()
In the data dictionary, “track_id” is described as an identifier for the songs in the data set. We checked to see whether there were duplicates of the “track_id” column. This resulted in 4481 duplicates being dropped, making the new dimensions of the cleaned data 28352 rows and 23 attributes.
spotify_data %>% distinct(track_id, .keep_all=TRUE) %>% dim()
## [1] 28352 23
spotify_data <- spotify_data %>% distinct(track_id,.keep_all=TRUE)
We see that the “duration_ms” column is given in milliseconds. This is not a standard measurement for the duration of songs. Therefore, we created a new variable called “duartion_m” which records the duration of the songs in minutes.
spotify_data <- spotify_data %>% mutate(duration_m = duration_ms/60000)
spotify_data <- select(spotify_data, -duration_ms)
colnames(spotify_data)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_m"
Upon reviewing the data, we find that track popularity varies based on both time and genre, therefore we wish to evaluate this relationship further in the exploratory data analysis section for which we will extract the year of the “track_album_release_date” column and create the variable “track_album_release_year” for a yearly trend analysis instead of a minute-by-minute analysis.
spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
spotify_data$track_album_release_year <- as.numeric(format(spotify_data$track_album_release_date, "%Y"))
track_popularity_uniques <- spotify_data %>% distinct(track_popularity) %>% select(track_popularity)
tags <- c("[0-20]","(20-40]", "(40-60]", "(60-80]", "(80-100]", "(100+]")
spotify_data_binned <- spotify_data %>%
mutate(track_popularity_tag = case_when(
track_popularity <= 20 ~ tags[1],
track_popularity > 20 & track_popularity <= 40 ~ tags[2],
track_popularity > 40 & track_popularity <= 60 ~ tags[3],
track_popularity > 60 & track_popularity <= 80 ~ tags[4],
track_popularity > 80 & track_popularity <= 100 ~ tags[5],
track_popularity > 100 ~ tags[6]
))
spotify_data_binned %>% distinct(track_popularity_tag)
## # A tibble: 5 × 1
## track_popularity_tag
## <chr>
## 1 (60-80]
## 2 (40-60]
## 3 (20-40]
## 4 [0-20]
## 5 (80-100]
To determine if outliers in the dataset should be removed, retained or imputed, we plot the boxplots below for each of the numerical attributes.
spotify_pivot <- spotify_data_binned %>% select(12:22) %>% pivot_longer(cols = danceability:tempo, names_to =
"Var", values_to = "val")
ggplot(spotify_pivot, aes(y = val, fill = Var))+
geom_boxplot(show.legend = FALSE, width = .6, position = "dodge")+
coord_flip() +
facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()
In the absence of domain expertise, we will not be able to remove these outliers at this point as they may provide some insight on the popularity of tracks with audience, which can then be taken into account to increase their popularity.
The following histograms illustrate the skewness of the data.
ggplot(spotify_pivot, aes(x = val, fill = Var))+
geom_histogram(show.legend = FALSE, position = "dodge") +
facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()
The following is the final preview of the cleaned data.
spotify_data_cleaned <- spotify_data_binned
datatable(head(spotify_data_cleaned, 25), options = list(
scrollCollapse = TRUE,scrollX = TRUE,
autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20, 25)
))
As a first step, we can introduce and analyze the user behavior patterns in detail, and then propose statistical models (for example, regression models, etc,) to answer our research questions and gain some insights.
First, we investigate the correlation between the song attributes to determine if there are any statistically significant dependent variables:
corr_data <-select(spotify_data_cleaned,track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_data), tl.col = 'black')
We plot the number of tracks released between 2005 and 2019 based on genres and develop insights from this data.
song_years_genre_df <- spotify_data_cleaned %>%
filter(track_album_release_year> 2005 & track_album_release_year<=2019)%>%
select('track_album_release_year', 'playlist_genre') %>%
group_by(track_album_release_year, playlist_genre) %>%
summarise(songs_released = n()) %>%
ungroup()
ggplot(song_years_genre_df, aes(x = track_album_release_year, y = songs_released)) +
geom_line(aes(color = playlist_genre)) +
ggtitle("The number of songs released by each genre over the years") +
ylab("Song releases") +xlab("Year of release")
We have briefly examined the user behavior patterns. We will dive into the details and develop statistical models (for example, regression models, etc.) to answer our research questions.